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Abstract 

With tens of petaflops supercomputers already in operation and exaflops 
machines expected to appear within the next 10 years, efficient paraUel com- 
putational methods are required to take advantage of such extreme-scale 
machines. In this paper, we present a three-dimensional domain decomposi- 
tion scheme for enabling large-scale electronic calculations based on density 
functional theory (DFT) on massively parallel computers. It is composed of 
two methods: (i) atom decomposition method and (ii) grid decomposition 
method. In the former, we develop a modified recursive bisection method 
based on inertia tensor moment to reorder the atoms along a principal axis so 
that atoms that are close in real space are also close on the axis to ensure data 
locality. The atoms are then divided into sub-domains depending on their 
projections onto the principal axis in a balanced way among the processes. 
In the latter, we define four data structures for the partitioning of grids that 
are carefully constructed to make data locality consistent with that of the 
clustered atoms for minimizing data communications between the processes. 
We also propose a decomposition method for solving the Poisson equation 
using three-dimensional FFT in Hartree potential calculation, which is shown 
to be better than a previously proposed parallelization method based on a 
two-dimensional decomposition in terms of communication efficiency. For 
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evaluation, we perform benchmark calculations with our open-source DFT 
code, OpenMX, paying particular attention to the 0{N) Krylov subspace 
method. The results show that our scheme exhibits good strong and weak 
scaling properties, with the parallel efficiency at 131,072 cores being 67.7% 
compared to the baseline of 16,384 cores with 131,072 diamond atoms on the 
K computer. 

Keywords: First principles calculations; Linear scaling method; Krylov 
subspace method; Domain decomposition; Modified recursive bisection; 
Inertia moment tensor; FFT grid decomposition 



1. Introduction 

Density functional theory (DFT) [H El is widely regarded as one of the 
most powerful and popular methods available in computational materials 
science simulations. It has been applied to various systems of interest, from 
physics, chemistry, and materials science to biology, for exploring a wide 
range of material and biological properties, such as structural and optical 
properties, electric transport, chemical reactions, etc. Although the Kohn- 
Sham (KS) method offers a feasible approach to performing the calculation 
of the ground state properties of A^-body systems, its scaling property of 
0{N^) has posed great challenges in large-scale calculations. There has been 
considerable effort to realize large-scale DFT calculations that can be roughly 
classified into two major approaches: first, improve and implement highly 
efficient parallelization of conventional methods, and second, develop better- 
than-cubic scaling methods, ideally linear scaling methods. 

In the first approach, real-space methods are well-established and signif- 
icant progress in developing practical real-space methods and implementa- 
tions has been observed. Recently, a real-space DFT code has achieved an 
unprecedented performance on the K computer [4J . Real-space implementa- 
tions of DFT applying finite difference methods [5j and finite element meth- 
ods O 13 El [9] have also been extensively developed. On the other hand, in 
the second approach, a number of linear scaling methods have been proposed. 
Several survey reports on linear scaling methods can be found in literature 
[ini [m [121 US EJ . A popular and straightforward approach to linear scal- 
ing is the divide and conquer (DC) method [151 EHl llTj. Recursion method 
proposed by Ozaki et al. in [181 UHl ED] is another approach. Other linear 
scaling methods include the early work presented in [21j, methods based on 
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Wannier functions [22l [23l El EH] and density matrix [26l EZl EHl ES] , and 
methods for tight-binding systems [ 301 [3T] and for metaUic and non-metalhc 
systems [32l [33j . 

Our main concern in this paper is the development of an efficient three- 
dimensional domain decomposition method that is applicable to both con- 
ventional methods and linear scaling methods to enable large-scale electronic 
calculations based on DFT on massively parallel computers. In fact, our 
linear scaling DFT method based on Krylov subspace method has been de- 
veloped and introduced in [34j . Even though there exist a number of parallel 
implementations of linear scaling DFT [35l|36l[37j and domain decomposition 
methods based on the space-filling curve technique [38l [39], these methods 
may suffer from difficulty in applying to anisotropic structures. Considering 
the fact that we are now in the petafiops era with several supercomputers 
having a peak performance of tens of petafiops and exafiops machines are 
expected to appear within the next 10 years with millions of cores, gen- 
erally applicable and efiicient methods are required to take advantage of 
such extreme-scale machines and solve a wide variety of problems. In par- 
allelizaton of linear scaling methods, as our implementation is also based on 
atomic orbitals [40j and their linear combination [41^ ^J, a known problem 
arises due to the use of localized atomic basis functions with different spatial 
cut-off radius for different elements. The irregular sparse structure of the 
resultant matrices make their parallel multiplication implementation become 
more complicated, since they require careful mapping of the sparse matrix el- 
ements to processing cores, unlike finite element methods where the matrices 
are regular. In addition, the domain decomposition method should hold the 
locality of clustered atoms in real space and assign nearby atoms to proces- 
sors to minimize inter-node communications. Also, it should be applicable 
to any distribution patterns of atoms in real space. 

To address the issues, in this paper we propose a domain decomposition 
scheme, which is actually composed of two methods: one for decomposing 
the atoms and the other for partitioning the grids among the processes. 
In the former, we develop a modified recursive bisection method based on 
inertia tensor moment to reorder the atoms along a principal axis so that 
atoms that are close in real space are also close on the axis. Depending 
on their positions on the axis, the atoms are then divided into sub-domains 
in a balanced way with data locality conserved. The method should be 
able to work well with any number of atoms and processes. Meanwhile, 
in the latter, we introduce four data structures for the grid partitioning 



3 



that are carefully constructed to make data locality consistent with that of 
the clustered atoms, resulting in the minimization of data communications 
between the processes. We furthermore propose a decomposition method for 
solving the Poisson equation using three-dimensional (3D) FFT in Hartree 
potential calculation, which is proven to be more communication efficient 
than one- and two-dimensional decomposition methods. We also let the 
processes perform on-the-fly communications to reduce the memory usage. 

The remainder of the paper is organized as follows. Section 2 describes 
briefly the research background, including the KS equation and our linear 
scaling Krylov subspace method. The domain decomposition methods for 
atoms and grids are presented in Section 3. Section 4 details implemen- 
tation issues and the calculation flowchart. Section 5 analyzes benchmark 
results with a focus on linear scaling, and strong and weak scaling properties. 
Finally, we conclude our study in Section 6. 

2. Background 

For the sake of completeness and self-explanation, we briefly introduce 
the background of DFT by way of the well-known KS equation, and give an 
overview of our linear scaling Krylov subspace method, while highlighting 
data structures related to our parallelization scheme. 

2.1. Kohn-Sham equation 

In principle, DFT is a modeling method to determine the electronic 
ground state of A^-body systems of atoms and molecules based on function- 
als of electron density. Although DFT may have its root date back to the 
Thomas-Fermi model and Slater's Xa method derived from a simpliflcation 
of the Hartree- Fock method j43l [44] , its theoretical foundation was only es- 
tablished flrmly by the Hohenberg-Kohn theorems introduced in the 1960s 
[1]. The resulting KS equation is typically represented in the form of an 
eigenvalue equation as below [l5l US] . 

HKS(t>A^) = (1) 

where H^s is the KS Hamiltonian, and s^^ is the orbital energy of the corre- 
sponding KS orbital (/>zy(r). Consequently, the density n(r) of the system is 
defined as: 

occ. 

n{r) = 2j2€{r)Mr), (2) 

i=l 
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where the factor of 2 is due to spin multiphcity, and v runs for aU the occupied 
states. Although our discussion is hmited within non-spin polarization, the 
generalization to spin polarization is straightforward. 

In our implementation, we utilize atomic orbital basis functions, written 
as the multiplication of a spherical harmonic and a radial function R\ 

Mr) = Yr{r)R{r). (3) 

The basis functions have been carefully constructed and optimized for con- 
vergence and reduction of computational effort with high accuracy, as well 
as memory usage [42j . Most notably, the use of atomic-like orbitals perfectly 
matches the idea of linear scaling methods. They are strictly confined within 



a sphere and stored in a special structure described later in Section |3.2.2 
In the same section, another structure for storing all the grid points inside 
a parallelepipedon defined by the basis functions of all atoms distributed to 
the same process in the parallel implementation will also be presented. 
The Hamiltonian is expressed as: 

^KS = -^v2 + i;eff(r), (4) 
where VQf[{r) is the Kohn-Sham eflFective potential, given by: 

VeEir) = Ve^t[r) + Vii^,tTee[r) + ^^^^^ . (5) 

Here, the first term, Ve^t{r)^ is the external potential due to the interaction 
between electrons and nuclei and other external fields. The second term 
'^Hartree(^) IS the classical Hartrcc (Coulomb) interaction of the electron den- 
sity interacting with itself, given by 

VR^TtTee[r) = / ndr . (6) 

J \r - r \ 

In our implementation, the Hartree potential is found by solving the Pois- 
son equation using FFT. In parallel implementations, the grid points must 
be decomposed to the processes in a proper fashion to minimize the com- 
munication amounts incurred during the calculation. We will introduce our 
grid decomposition method and compare with other methods in terms of 



communication efficiency in Section 3.2 
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The last term in Eq. ^ is the exchange-correlation potential, where 
£'xc is the corresponding exchange-correlation energy that is the unknown 
in this KS approach. A number of methods have been proposed to yield 
approximations to this exchange-correlation energy, for instance, local den- 
sity approximation (LDA) and generalized gradient approximation (GGA) 
We will construct a particular structure for the calculation of this 



potential in Section 3.2.2 



The KS equation turns out to be in the form of one-body equations under 
the effective potential. Combined with the methods for approximating E'xc, 
the equation can be solved self-consistently with a resulting ground state 
density. Then the corresponding energy can be determined: 

EKs[n] =Ts + J n{r)ve^t{r)dr + ^HartreeM + E^c[n], (7) 

where Tg is the KS kinetic energy of non-interacting electrons, also expressed 
in terms of the KS orbit als: 



occ. 



T,^-y i <t>:\jH.dr. (8) 



2.2. Linear scaling Krylov subspace method 

The basic idea behind the Krylov subspace method [34] is to combine the 
DC method and the recursion method based on the Green's function. The DC 
method has been proven good for covalent systems but not for metals due to 
the difficulty of direct diagonalization for large truncated clusters, while the 
recursion method is known for its numerical instability in SCF calculations. 
The three methods are founded on the same fact that the charge density n 
can be given by summing up the local charge density of each site, which 
is actually evaluated with the density matrix p: 

(9) 

where i is the site index, and a is the orbital index of the basis functions. 

The density matrix p is in turn calculated by the one-particle Green's 
function: 

Aa,,7? = -l^^j G^a,Jp{E + ^0+)/ {^^^ dE , (10) 
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(1) Truncate in real space 

Figure 1: Linear scaling Krylov subspace method. 

where f{x) = 1/[1 + exp(x)] is the Fermi function, and 0^ is a positive 
infinitesimal. 

The problem now turns to how to evaluate the Green's function in linear 
scaling time, and hence each method has its own approach. The Krylov 
subspace method can be seen as a DC method defined in a Krylov subspace 
for reducing the size of the truncated clusters, as the Krylov subspace is much 
smaller than the original vector space. In this method, the total Green's 
function can be approximated by the sum of the local Green's functions 
associated with the central atom in each truncated cluster. 

Figure [T] demonstrates how the Krylov subspace method works in practice. 
First, in the Divide step, the original system is decomposed into smaller trun- 
cated clusters using a physical truncation scheme, described later in Section 
|4} The truncated clusters are assigned to different processes for processing 
in parallel. Then, in the Solve step, the Krylov subspace for each truncated 
cluster is generated independently on each process without any communica- 
tions. After that the Green's function associated with each truncated clusters 
is evaluated, and the back transform is performed to get the Green's func- 
tions represented by the original basis functions for calculating the density 
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matrix p which is then found by Eq. (10), foUowed by the local charge den- 
sity rii by Eq. Finally, in the Conquer step, the total charge density n is 
given by combining all the local charge densities (Eq. ([9])). The density is 
treated by regular mesh, and we will therefore introduce the grid decompo- 
sition method with four data structures for storing data on the grid points 
in Section [3^ The Krylov subspace method has been proven to be efficient, 
robust, could converge rapidly for both metals and insulators, and has been 
recently applied to a large-scale molecular dynamics simulation for electro- 
chemical systems [4Z| and structural prediction of precipitates of a carbide 
in bcc iron HSl. Interested readers are referred to [341 for details. 



3. Domain decomposition methods 

Domain decomposition should be performed in three dimensions, espe- 
cially for large-scale calculations with a large number of processes, consid- 
ering the scaling properties of communication amount and memory usage. 
One- and two-dimensional decomposition methods are only appropriate for 
up to a certain number of processes, and when that number exceeds the limit 
of one or two dimensions, the amount of communication and memory us- 
age will stay unchanged. In contrast, three-dimensional methods can realize 
much larger scales of calculations, as they are perfectly suited to the three- 
dimensional nature of systems' structure. In this section, we present our 
three-dimensional domain decomposition scheme, consisting of two methods: 
one for the atom level and the other for the grid level in three-dimensional 
space. 

3.1. Three-dimensional atom decomposition method 
3.1.1. Requirements 

In parallel implementations, atom decomposition among the processes is 
a very important first task that has a considerable impact on the overall 
performance. We consider a 3D domain decomposition method for atoms in 
a system, which should hold the locality of clustered atoms in real space with 
the following requirements. 

• Approximately the same computational amount: Each sub- 
domain should have approximately the same computational amount. 
This requirement is obvious for ensuring a good load balance among 
the processes. As atoms of different elements have different process- 
ing time, each atom should have a corresponding weight refiecting the 
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amount of computation. The only exception is the case of the first MD 
step or geometry optimization step, where aU the atoms have the same 
weight of 1. 

• Locality: Nearby atoms should be grouped together and assigned to 
the same process. This will reduce the amount of communication and 
assists the development of linear scaling methods, which are also essen- 
tially based on the idea of locality. 

• Applicable to any number of atoms and processes: Reahstic sys- 
tems can have any number of atoms, and can be solved by any number 
of processing cores available to maximize resource usage. Hence, in 
order to treat them eflFectively, the method should be able to work with 
any number of atoms and processes. 

• Applicable to any distribution pattern of atoms: Some specific 
systems have most atoms distributed around the center of mass, while 
some have atoms spread uniformly in the space, and others may have 
a majority of atoms lie along just one main dimension. Such irregular 
distributions of atoms should be considered in the method. 

• A well-defined algorithm: To be clear and easy to follow. 

Figure [2] exhibits an expected domain decomposition result for a system 
of 26 atoms, in 2D for the sake of simplicity, where nearby atoms are nearly 
equally allocated into four sub-domains. Given the requirements, a proper 
3D domain decomposition can be performed based on two ideas: modified 
recursive bisection and inertia moment tensor, which are presented subse- 
quently. 

3.1.2. Modified recursive bisection method 

The modified recursive bisection method is a useful tool to decompose 
the system with any number of processes and atoms, unlike the conventional 
recursive bisection method that is restricted to numbers equal to a power of 2 
|49j . The method involves constructing a binary tree, and each tree node has 
a number representing the number of processes treating the domain under 
that node. Leaf nodes have a number of processes valued at 1, and the 
number of leaf nodes is equal to the number of processes. In practice, the 
method works as follows. 
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Figure 2: An example of domain decomposition for atoms with each color representing a 
sub-domain. 

First, the original domain is divided into two sub-domains, each having 
approximately the same number of processes, which is about half of that 
of the original domain. Each sub-domain is then further divided into two, 
again with each having nearly the same number of processes. The division 
is repeated until the number of processes becomes 1. Once the division 
procedure has been completed, there are as many leaf nodes as processes. 

Figure |3] shows an example of the method where the number of processes 
is 19. It is also important to note that in the conventional recursive bisection 
method limited to the number of processes equal to a power of 2, the bisection 
is made so that equal numbers can be assigned to each sub-domain. However, 
the modified method bisects with weights as shown in the figure. 



19 









10 














9 












5 








5 






5 








4 






3 




2 




3 




2 


3 




2 




2 




2 




2 


ll 


ll 


ll 


2 


ll 


ll 


ll 2 


ll 


ll 


ll 


ll 


ll 


ll 


ll 



iFTl ipTl 

Figure 3: Modified recursive bisection method. 
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3.1.3. Inertia tensor moment 

If the modified recursive bisection method meets the requirement to work 
with any number of processes and atoms, the use of an inertia moment tensor 
for the bisection in the method ensures the locahty of the domain decompo- 
sition. The inertia moment tensor is apphed to calculate a principal axis for 
each sub-domain during the decomposition process described above. Then 
atoms in the corresponding sub-domain are reordered from three dimensions 
to one dimension by projecting them onto the principal axis. This reordering 
will also make it much easier to divide the atoms into two sub-domains to 
fit the structure of the binary tree. The inertia tensor moment approach is 
expected to work well, as atoms that are close in real space are also close on 
the principal axis yielded by the inertia tensor. Figure [4] illustrates the idea 
of 3D-to-lD reordering of atoms. 



Three dimensions One dimension 

, Principal axis 



Reorder) %% m % % 




Figure 4: Reordering of atoms by a principal axis found by the inertia tensor moment. 
Mathematically, the equation of the principal axis can be written as: 

X Cx ~\~ CLxt 

y = Cy + ayt (11) 
z = Cz + a^t, 

where {Cx^ Cy^Cz) is the center of mass of the system, defined as the average 
of the atoms' positions, x^, weighted by their weights, Wi^ with the total 
number of atoms, A^^^: 

~ Na ^^=1 ^^^^ 
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Projection ti of an atom i on the principal axis is given by: 



U = a^^ixi - Cj,) + ay{yi - Cy) + a^(zi - C^). (13) 
We then define a function F as: 

F = Y,^,e,-\{\a\'-l). (14) 

i 

The function F is now the target for maximization to find the possible longest 
principal axis onto which all atoms can be projected. The derivatives of F 
over the coordinate axes: 

'l^-'f-'f-' (15) 

oax oay oaz 



lead to an eigenvalue equation: 







( ax 


ay 




ay 






\az 



(16) 



T is the tensor of inertia about the center of mass with respect to the xyz 
axes, written in a matrix form as: 



where 



y=yi- Cy (18) 

Zi Zi — C 



z • 



Finally, the principal axis is found by solving the eigenvalue problem with 
the inertia tensor. After solving the eigenvalue problem, we obtain three 
eigenvalues and the corresponding eigenvectors. Among the three states. 



we choose the state that maximizes the function F in Eq. (14). Also, the 
derivations imply that there are many ways to properly decompose a system 
by changing the definition of the function F. In our method, it is expressed 
as a maximization problem to find the correspondingly longest principal axis 
so that all atoms can be projected onto it. 
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Figure 5: Atom decomposition with the modified recursive bisection method and the 
principal axis. 
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Figure [5] exemplifies the operation of our atom domain decomposition 
scheme based on the modified recursive bisection method and the principal 
axis. The binary tree is constructed with the root representing the original 
domain consisting of 26 atoms. The principal axis of the original domain 
is then found, based on which the domain is divided into two sub-domains 
with each having 13 atoms. The sub-domains are assigned to the two child 
nodes of the tree and the process of finding the principal axis and dividing the 
domain is repeated for each sub-domain. The domain on the left child node is 
divided into two sub-domains with one having 7 atoms and the other having 
6 atoms, and so is the domain on the right child node. The decomposition 
is performed recursively until there are as many sub-domains as processes. 

3.1.4' Load balancing 

As different atoms may have different processing time depending on their 
properties, each atom is associated with a weight for load balancing. As a 
result, both the number of atoms and weights are considered in the decom- 
position method in an effort to distribute the amount of computation equally 
among the processes. The weight of an atom i is defined as: 

Etimej 

Mtime ^ ^ 

where Etimei is the processing time of the atom i in the previous MD step, 
and Mtime is the longest processing time of all atoms. 

It should be noted that the weight definition has an important impact 



upon reordering the atoms. The weight defined in Eq. (19) tends to position 
light atoms at around the center, while heavy atoms tend to move away from 
the center. As the bisection for dividing the domain is usually found near 
the center, the decomposition is indeed more precise when light atoms are 
also positioned around it. Therefore, our method based on the inertia tensor 
moment fulfils the requirement for load balancing and locality. 

3.1.5. Algorithm for atom decomposition 

Figure |6] outlines the algorithm for the atom decomposition method with 
a combination of the modified recursive bisection method and the inertia 
tensor. It starts with the root node of the binary tree representing the 
original domain as the current node. If the number of processes is 1 or the 
number of atoms is 0, the current node is the leaf node and the algorithm will 
stop. Otherwise, it will find the center of mass with weights of the current 
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domain (Eq. (|12|)), and compute the inertia tensor based on Eq. (17). After 
that, the inertia tensor is diagonahzed to obtain the principle axis. Now the 
projection ti of each atom onto the principal axis can be determined using 



Eq. (13). They are then sorted to find the bisection for dividing the current 



node into two child nodes based on Eq. (19) where the sum of weight is 



nearly equivalent to each other. The algorithm repeats with each child node 
until reaching the leaf node, where there is only one assigned process or no 
atom available. 



START: consider the root 
[ node representing the original 
, domain as the current node , 



Np= number of 
processes 



Na= number of 
atoms 




Find the center of 
mass (Eq. (12)) 



Compute the inertia 
tensor (Eq. (17)) 



Diagonalize the 
inertia tensor 



Get the principle axis 

(ax, ay, az) 



Calculate the 
projection ti for each 
atom (Eq. (13)) 



Sort the projections ti 



Find the bisection 
and divide into two 
child nodes 



Assign Npl and go to 
the left child node 



Assign Npr and go to 
the right child node 



Figure 6: The algorithm for decomposing atoms. 



3.2. Three-dimensional grid decomposition method 

As the regular mesh technique is used in our implementation [50], after 
the atoms have been distributed to the processes, the grid points belonging to 
their basis functions must also be allocated to the corresponding ones. Once 
the density matrix p has been calculated by Eq. (10), the charge density can 
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be found by Eq. ^ with the basis functions using regular mesh. As the 
Hartree potential needs to be calculated by FFT using the whole system, the 
allocation of the grids to the processes should be carried out effectively so 
that the communication amount among them is as small as possible. 

3.2.1. Requirements 

In parallel with the atom decomposition method, a 3D domain decompo- 
sition method for grids is also developed with the following requirements. 

• Approximately the same number of grid points: Similar to the 
case of atom decomposition, processes should be assigned nearly the 
same number of grid points so that they can perform approximately 
the same computational amount. 

• Almost linear scaling in memory usage: Memory usage is always 
a source of concern in DFT-based calculations, as a large amount of 
memory is demanded for storing a huge number of grid points. As 
such, the amount of memory required for serial calculations should be 
only linear to the system size, and should remain a constant for parallel 
calculations as a function of the numbers of atoms and processes when 
the same number of atoms is allocated to each process. 

• Locality: Grid points within the sphere of a basis function (Eq. ^) 
belonging to an atom that is allocated to a process should be assigned 
to the same process with the atom. The locality should also be held 
when distributing the grids to the processes. This will have two-fold 
benefits: minimize the amount of communication and the amount of 
memory necessary for storing data on the grid points overlapping with 
those of the basis functions for all atoms in one process. 

3.2.2. Data structures for MPI parallelization 

To comply with the requirements, here we define four data structures 
to make the data locality consistent with that of the clustered atoms for 
minimizing the amount of communication between the processes, which are 
shown in Fig. [7| 

• Structure A: This structure stores all the grid points inside the sphere 
of a basis function. 
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Structure A 




Structure B 
(case of abc) 




Structure C 




Structure D 



Figure 7: Data structures for MPI parallelization. 



• Structure B: This structure stores all the grid points allocated to 
a specific process. Here we propose a two-dimensional decomposition 
method for parallel 3D FFT, where the distribution of the grid points 
is carried out in a two-dimensional grid defined by the first two di- 
mensions. Therefore, the structure B may exist in different forms by 
changing the order of distribution. For example, in case of abc dis- 
tribution, the grid points on the a6-plane with the c axis are divided 
over the processes in ascending order of their a and b coordinates so 
that each process has approximately the same number of grid points 
on the a6-plane. The grid points extending along the c direction that 
have the same a and b coordinates are also assigned to the same pro- 
cess. Assume that the numbers of grids of the a-, 6-, and c-axes 
are A^a, and Ac, respectively, the number of processes is A^, and 
myid is the process's ID. In our method, the process with myid will 
be assigned the grid points from the coordinate of (as,6s,0), where 



a^A'b + bs 



myidxN^Ni,^Np-l 



to the coordinate of {ae^be^N^ — 1), 
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where ttgA^b + be 



in ascending order of 



their a, 6, and c coordinates. As a result, our decomposition method 
can be viewed as a hybrid method between one-dimensional decompo- 
sition and two-dimensional decomposition, since it is the same as the 
former for up to a certain number of processes, and is gradually akin 
to the latter for larger numbers of processes. Detailed analysis on the 



communication amount will be given later in 3.2.3 



• Structure C: The structure stores all the grid points inside a paral- 
lelepipedon defined by the basis functions of all atoms allocated to one 
process, paying attention to the cutoflF radius of the basis functions. 

• Structure D: This structure can be seen as an extension of the struc- 
ture B(a6c) because it is created by adding some buffer regions around 
the structure B. Strictly speaking, this structure is only necessary for 
GGA ^51j as it requires neighboring grid points for calculating the gra- 
dient of the charge density by a finite difference method. Meanwhile, 
LDA j52] can be performed with the structure B without having to 
refer to the structure D. 



The introduction of the data structures makes it more straightforward in 
the calculations. For example, one can easily find a specific basis function 
by referring to the structure A. As can be seen subsequently, the structure 
B is related to a two-dimensional decomposition method for FFT that is 
more efficient than previously proposed methods in terms of communication 
amount. 

3.2.3. MP I communication analysis in the domain decomposition method for 
3D FFT 

In our two-dimensional decomposition method for parallel 3D FFT, as 
different distributions of the structure B are involved in the calculations in 
different orders, the order of reference may have an impact on communica- 
tion amount. Figure |8] shows an analysis on the amount of communication to 
transpose the structure B from one to another depending on the number of 
processes, in comparison with a traditional one-dimensional decomposition 
method and a two-dimensional decomposition method ^53j . Interestingly, the 
amount of communication is grouped into only two patterns of communica- 
tion in our method, and there is one pattern actually better than the other. 
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Figure 8: MPI communication analysis in the parallelization of 3D FFT for (a) our method 
and (b) other one- dimensional method and two-dimensional method ^53^ with N x N x N 
grid points. 
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In general, we can choose any of the four orders leading to the pattern with 
the smaller amount of communication, for instance abc cab cba^ or 
abc cba cab. However, we choose the order of abc cab cba 
to keep consistency with our current implementation of other functionalities 
such as non-equilibrium Green's function (NEGF) method. 
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Figure 9: Comparison of communication amount in the parallelization of 3D FFT in case 
of 64 X 64 X 64 grid points {N = 64). The inset enlarges the left part of the main graph. 

Figure [9] shows a comparison among our proposed method and the two 
other methods in terms of communication amount in case of 64 x 64 x 64 
grid points. Certainly, similar results can be produced for smaUer and larger 
numbers of grid points. In this case, the one-dimensional method is able to 
work only along one dimension and is limited to 64 processes, while the two- 
dimensional method always decomposes the domain in two dimensions, even 
for fewer than 64 processes. In contrast, our method offers a combination of 
these methods, as it partitions only along one dimension when the number of 
processes is up to 64 on condition that it is a divisor of 64, and decomposes 
in two dimensions while still starting from one dimension for larger process 
numbers. In other words, our method is based on a row-wise decomposition, 
as against a square decomposition approach taken in [53j. In doing so, most 
of communication amount is incurred when transferring from abc to ca6, and 
a majority of data are reused when transferring from cab to cba with just a 
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small amount of communication. As a result, up to 64 processes, our method 
works in the same fashion as the one-dimensional method provided that 64 
is a multiple of the number of processes, and is about 60.0% to 77.8% better 
than the two-dimensional method with 64 x 64 x 64 grid points. Beyond 
this point, the one-dimensional method is no longer applicable, while the 
two other methods can operate until reaching the limit of 64 x 64 = 4096 
processes. The difference in performance gradually decreases, however, and 
eventually becomes from 2048 processes. 

3.2.4' Data structures and calculation flow 

The relationship between the data structures and the calculation flow is 



depicted in Fig. 10, The calculation of charge density (Eq. ([9])) is performed 
with the structures A, B, and C in order of appearance. The Hartree potential 
(Eq. ([6])) is calculated by solving the Poisson equation using FFT for the 
whole system with diflFerent distributions of the structure B in the order 
of abc cab cba cab abc. The exchange-correlation potential 
(the last term in Eq. ^) is determined from either the structure C or the 
structure D depending on the approximation method in use. The structure 
B is also utilized for performing the charge mixing {cab cba abc)^ 
and for calculating the total energy (Eq. It should be noted that 

communication is required when the calculations move from one structure to 
another. 



4. Implementation 



Figure [TT] illustrates the implementation flowchart in terms of the com- 
putational flow that consists of three key steps: atom decomposition, grid 
decomposition, and 0{N) calculation. The flrst two steps are common, and 
applicable to other schemes such as conventional DFT calculations, NEGF 
method, etc., while the last step is specially for our linear scaling Krylov 
subspace method. 

Step 1 Atom decomposition. The atoms are allocated to the processes by 
the proposed atom decomposition method based on a combination 
of the modifled recursive bisection method and inertia tensor. The 
algorithm describing its operation is detailed in section [3.1.5 



Step 2 Grid decomposition. The grid points are decomposed using our 
grid decomposition method. It constructs the structures A, B, C, and 
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(a) Data structures of grid decomposition. 
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Figure 10: (a) Data structures and (b) calculation flow. 



D based on their definition. The data structure for transferring Pi{r) 
from the structure A to the structure B when the charge density p(r) 
is calculated in the structure B using Pi{r) is constructed. Similarly, 
the data structure for transferring p{r) from the structure B to the 
structure C when p(r) is constructed in C using p(r) stored in B, 
and the data structure for transferring p(r) from the structure B 
to the structure D for the case that p(r) is constructed in D using 
p{r) stored in B are constructed. In order to calculate the Hartree 
potential by FFT with different distributions of the structure B, the 
data structures for MPI communications in FFT from abc to cab and 
from cab to cba are also built. It is worth mentioning that we employ 
non-blocking MPI routines MPI_Isend() and MPI_Irecv() for point-to- 
point communications between pairs of processes to cut the number of 
communication steps between the processes and hence, reducing their 
waiting time. This implementation is definitely much more efficient 
than posting pairs of blocking MPI_Send() and MPI_Recv(). 
Step 3 0{N) calculation. The linear scaling Krylov subspace method is 
implemented in this step. Although the atoms have already been al- 
located nearly equally to the MPI processes in Step 1, here we further 
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Figure 11: Computational flow. 

decompose the allotted atoms to the processing cores in each MPI pro- 
cess using OpenMP thread parallelization in the hybrid MPI/OpenMP 
implementation. 

Step 3a System truncation: Physical truncation scheme [19] is em- 
ployed to divide the large system into smaller truncated clus- 
ters. It collects all the neighboring sites within a sphere with 
a certain cutoff radius to construct the corresponding clus- 
ters. 

Step 3b Construction of cluster Hamiltonian: The Hamiltonian 
of each truncated cluster is constructed in the conventional 
fashion [50j. The matrix elements in each site can be cal- 
culated independently in parallel, and the cluster Hamilto- 
nian is built by gathering all the elements from the processes 
through communications. Also, the Hartree potential is de- 
termined from all the contributions from the truncated clus- 
ters to ensure high accuracy. 
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Table 1: Software configuration. 



Parameter 


Setting 


OpenMX 


3.637 (developer release) 


level, of. stdout 


1 


level. of.fileout 





scf.XcType 


GGA-PBE 


scf.maxlter 


10 


scf.EigenvalueSolver 


krylov (0(N) method) 


scf.ProExpn.VNA 


off 


MD.Type 


nomd 



Step 3c Generation of Krylov subspace: For numerical robust- 
ness, the Krylov subspace of each truncated cluster is gen- 
erated only at the first SCF step, and is re-utilized in sub- 
sequent steps [34j. The Krylov subspace generation is inde- 
pendent and performed in parallel. 

Step 3d Construction of effective Hamiltonian: The effective 
Hamiltonian is constructed by determining the short and 
long-range contributions, which can be performed in parallel 
on each process. For truncated clusters with a large buffer 
region, the long-range contribution is calculated only at the 
first SCF step and re-used in subsequent steps. 

Step 3e Self-consistency: The standard eigenvalue problem with 
the effective Hamiltonian is solved, and the eigenvectors are 
obtained by performing the back transformation to calculate 
the charge density. After that, a common chemical potential 
that conserves the total number of electrons for all the trun- 
cated clusters is determined by a bisection method, where 
MPI communication is performed using MPLAUreduce [34j . 

5. Benchmark results 

5.1. Software and system configuration 
5.1.1. OpenMX 

All the calculations are performed using OpenMX [54J , which is an open 
source parallel software package developed based on DFT, norm-conserving 
pseudopotentials, and pseudo-atomic localized basis functions. In particular, 
we use OpenMX version 3.6, release 37 (developer release) with a focus on 
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Table 2: System configuration. 



Parameter 


Setting 


Number of cores 


XT5 and FXIO: Up to 2,048 
K computer: Up to 131,072 


Number of threads 


XT5: 1 (flat MPI), and 4 and 8 (hybrid) 
FXIO: 1 (flat MPI), and 8 and 16 (hybrid) 
K computer: 8 (hybrid) 


Number of atoms 


XT5 and FXIO: Up to 65,536 
K computer: Up to 262,144 



the linear scaling Krylov subspace method. The number of SCF iterations 
is set at 10 for each calculation with no geometry optimization. In addition, 
GGA is employed to calculate the exchange-correlation potential, a required 
energy cutoff of 150 Ry is specified in the real space grid techniques for 
numerical integrations and solving the Poisson equation using FFT, while 
the cutoff energy is determined by the unit cell size to minimize the cutoff 
energy difference in the a—, 6—, and c— axes, and the electronic temperature 
is set at 300 K for counting the electron number. With regard to the Krylov 
subspace method, the radius of a sphere centered on each atom is fixed at 
6 Ang, and the dimension of the Krylov subspace in each truncated cluster 
is set at 400. As a result, the number of atoms in the truncated cluster 
is 159 with the diamond structure. With these settings in our method, we 
confirm that the absolute error in the total energy is 0.00072 Hartree/atom 
in self-consistent calculations with the diamond structure, in comparison to 
the /c-space conventional method performed with a large number of k points. 
Table [T] summarizes some important parameters. 

5.1.2. System configuration 

We perform the series of benchmark calculations on three different ma- 
chines: Cray XT5, Fujitsu FXIO, and the K computer. Each XT5 node 
is composed of two quad-core AMD Opteron 2.4GHz with 16GB memory, 
coupled with Cray SeaStar2+ interconnect, while each FXIO node has one 
1.848GHz 16-core SPARC64 IXfx processor, 32GB memory, and Tofu in- 
terconnect. Each K computer node contains one 2.0GHz 8-core SPARC64 
VHIfx processor, 16GB memory, and the same Tofu interconnect as FXIO. 
We use up to 2,048 cores on the first two machines and 131,072 cores on the 
K computer. On the other hand, as OpenMX is capable of executing in both 
flat MPI and hybrid MPI/OpenMP modes, several combinations of the num- 
ber of processes and threads are applied to particular system architecture. 
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Specially, the calculations are carried out with 1 thread (flat MPI), 4 and 8 
threads (hybrid) on XT5, 1, 8, and 16 threads on FXIO, and 8 threads on 
the K computer. The system conflguration is shown in Table [2} 

5.1.3. Benchmark system 

We mainly employ the diamond structure, which is well-suited to bench- 
mark calculations due to its uniformity, with the basis function of C5.0-52]92, 
where C represents the atomic symbol, 5.0 is the cutoff radius (Bohr) in the 
generation by the conflnement method, and s2p2 indicates that two radial 
functions are used to expand the basis functions for s- and p- orbitals, re- 
spectively. The numbers of atoms used on XT5 and FXIO are 2,048, 4,096, 
8,192, 16,384, 32,768, and 65,536 although we conflrmed that systems of up 
to 131,072 atoms are runnable on XT5. For the K computer, we test with 
systems of up to 262,144 atoms. The number of atoms is also shown in Ta- 
ble [2j Furthermore, in order to perform test calculations with a structure 
different from the uniform diamond structure, we choose the long deoxyri- 
bonucleic acid (DNA) structure consisting of cytosines and guanines with 
2,600, 13,000, and 26,000 atoms, and the basis functions of H7.0-52]9l, C7.0- 
s2p2dl, N7.0-52p2rfl, O7.0-52p2rfl, and P9.0-52j92rfl. 

5.2. Results 

To minimize the impact of system performance variations, each calcu- 
lation is executed flve times, except for the K computer, and the average 
values are taken as the results of the calculation. The following subsec- 
tions present the calculation results, divided into a demonstration of the 3D 
atom decomposition method, linear scaling property by the Krylov subspace 
method, strong and weak scaling properties with the diamond structure, 
strong scaling property with the DNA structure, and strong scaling property 
of sub-routines, followed by an analysis on memory usage. 

5.2.1. 3D atom decomposition 



Figure [12] demonstrates the operation of our 3D atom decomposition in 
the cases of 16,384 diamond atoms and 19 MPI processes, and multiply- 
connected carbon nanotubes consisting of 564 carbon atoms with 8 and 16 
MPI processes, where atoms of the same color are allocated to the same MPI 
process. In the diamond case, even though 19 is a prime number meant to 
pose challenges to general decomposition schemes, our method proves to be 
able to work efficiently, and atoms are practically localized to the process 
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(a) Diamond: Top view. (b) Diamond: Side view. 




(c) Multiply-connected carbon nanotube: 8 MPI processes. 




(d) Multiply-connected carbon nanotube: 16 MPI processes. 



Figure 12: 3D atom decomposition examples: 16,384 diamond atoms and 19 MPI processes 
with (a) top view and (b) side view, and multiply-connected carbon nanotubes consisting 
of 564 carbon atoms with (c) 8 MPI processes and (d) 16 MPI processes. Atoms of the 
same color are allocated to the same MPI process. 
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holding them. Similar atom locality conservation is also obtained with the 
multiply-connected carbon nanotubes, where nearby atoms are distributed 
to the same process. This is of great importance to reduce communication 
amount and memory usage for performance enhancement. Although our 
examples are limited to up to 19 processes, it is guaranteed that the systems 
are well divided for many processes, since the decomposition is recursively 
applied to sub-systems. 

5.2.2. Linear scaling property 

With the 0{N) Krylov subspace method, the elapsed time should be 
approximately linear to the number of atoms when the number of cores is 
fixed. Figure [13] presents the relationship between the elapsed time scaling 
factor and the number of atoms with a fixed number of cores of 2,048 on XT5 
(a), and 16,384 cores on the K computer (b) with the diamond structure. The 
relationship appears to be nearly linear: an increase in the number of atoms 
by a factor of 2 leads to an increase in the elapsed time by a factor of a little 
bit less than 2 in all three cases of 1, 4, and 8 threads. In case of 65,536 atoms 
in Fig. [T3l(a), the largest and smallest elapsed time scaling factors are 13.2 (4 
threads) and 12.1 (1 thread), respectively, as opposed to the linear factor of 
16 against the baseline of 4,096 atoms. The same behavior can be observed on 



the K computer, as shown in Fig. 13(b). The reason behind the better-than- 
linear elapsed time scaling factors is a higher computation to communication 
ratio with a higher number of atoms per core. When the number of cores is 
fixed and the number of atoms is increased, the computational amount per 
core will also be increased, leading to a higher efficiency. For instance, with 
a fixed number of 16,384 cores on the K computer (Fig. [T3|3), the number 
of atoms per core in case of 16,384 atoms is only 1, while this number is 
8 for 131,072 atoms, resulting in the computational amount per core with 
131,072 atoms being 8 times larger than that with 16,384 atoms. Suppose the 
communication amount stays almost unchanged for a fixed number of cores, 
the computation to communication ratio with 131,072 atoms would also be 8 
times higher than that with 16,384 atoms. Such a linear scaling property is 
definitely essential to enable extreme-scale applications that require hundreds 
of thousands to millions of atoms. The property will also be demonstrated 
with the DNA structure in I5.2.4[ 
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Figure 13: Linear scaling property with the diamond structure. 
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(a) Strong scaling with 16,384 atoms. 
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(b) Weak scaling with 16 and 32 atoms per core. 



Figure 14: (a) Strong and (b) weak scaling on XT5 with the diamond structure. 
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(a) Strong scaling with 16,384 atoms. 
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(b) Weak scaling with 16 and 32 atoms per core. 



;ure 15: (a) Strong and (b) weak scaling on FXIO with the diamond structure. 
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(a) Strong scaling with 131,072 atoms. 
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(b) Weak scaling with 2 and 4 atoms per core. 



Figure 16: (a) Strong and (b) weak scaling on the K computer with the diamond structure. 
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5.2.3. Strong and weak scaling with the diamond structure 



Figures [T4| [15] and [T6| respectively, show the strong scahng property in 
terms of paraUel efficiency and elapsed time, and weak scaling property in 
terms of elapsed time on XT5, FXIO, and the K computer with the diamond 
structure. In strong scaling calculations, the number of atoms allocated to 
one processing core is decreased with the increasing number of cores, resulting 
in a reduction in elapsed time. The strong calculations are performed with 
16,384 atoms on both XT5 and FXIO from 256 to 2,048 cores, each with 1, 
4, and 8 threads on XT5, and with 1, 8, and 16 threads on FXIO. On the 
K computer, the strong scaling results are taken with 131,072 atoms and up 
to 131,072 cores in the hybrid mode (8 threads). The parallel efficiency of 
strong scaling is given by the following equation: 

Pc = ^ X 100%, (20) 

where Pc is the parallel efficiency of the target calculation, Tb and A^b are the 
elapsed time and number of cores of the base calculation, respectively, and 
Tq and A^c are the elapsed time and number of cores of the target calculation, 
respectively. The base calculation used the equation is the case of 256 cores 
with XT5 and FXIO, and the case of 16,384 cores with the K computer, the 
lowest numbers of cores. 

On the other hand, weak scaling calculations demand the number of 
atoms allocated to one processing core be fixed when changing the numbers 
of cores and atoms to maintain a constant elapsed time in the ideal case, 
due to a constant computational amount per core. The number of atoms 
per core in the weak calculations is set at 16 and 32 from 128 cores to 2,048 
cores on XT5 and FXIO, and 2 and 4 from 4,096 cores to 65,536 cores on the 
K computer. For example, on XT5 and FXIO, in case of 16 atoms per core 
and 128 cores, there are 2,048 atoms to be processed, and in another case of 
2,048 cores and 32 atoms per core, we have to calculate 65,536 atoms. 

We note several observations. First of all, both fiat MPI and hybrid 
MPI/OpenMP modes exhibit good strong scaling property, with the parallel 
efficiency ranging from 68.8% to 96.4% on XT5 (Fig. [l4|(a)), and from 59.1% 
to 94.1% on FXIO (Fig. [T5|(a)). On the K computer, the parallel efficiency at 
131,072 cores is 67.7% compared to the baseline of 16,384 cores (Fig. [T6|(a)). 
Also, the parallel efficiency of the hybrid mode tends to be higher than that 
of the fiat MPI mode, especially with the largest number of cores, probably 
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due to a smaller number of MPI processes, accompanied by a smaller amount 
of communication, for the same number of cores. Using 2,048 cores in the 
fiat MPI mode, the calculation of 16,384 atoms could be finished in less than 
250 seconds on XT5. 

For weak scaling on XT5 (Fig. [T4|(b)), the elapsed time increases grad- 
ually with the number of cores due to a growing communication amount 
caused by more MPI processes. Nevertheless, the increase rate is quite small, 
ranging from 14.6% (1 thread with 32 atoms/core) to 33.8% (8 threads with 
16 atoms/core), when the number of cores is raised by a factor of 16 from 
128 to 2,048. More importantly, the eflSciency maintains its property when 
the calculation scale is enlarged gradually from 2,048 to 32,768 atoms (16 
atoms/core), and from 4,096 to 65,536 atoms (32 atoms/core). This observa- 
tion suggests that we can expect similar performance for even larger scales, 
as shown on the K computer (Fig. [T6|(b)). On FXIO (Fig. [T5|(b)), unfor- 
tunately we experience some sudden jumps in elapsed time with the hybrid 
mode, while the fiat MPI mode shows a very similar scaling property to that 
on XT5 with a solid performance (in the same Fig. [T5|(b)). 

In comparison between the fiat MPI and hybrid modes, the elapsed time 
of the fiat MPI mode is always lower than that of the hybrid mode with 4 
threads, which in turn is always lower than that of the hybrid mode with 8 
threads. The outcome is the same for both strong and weak scaling, asserting 
that fiat MPI is the best mode in both machines at this calculation scale. 

Finally, in weak scaling on XT5 with both hybrid and fiat MPI modes 
and on FXIO with the fiat MPI mode, when the number of atoms per core 
is increased by a factor of 2 from 16 to 32, the elapsed time also increases 
almost exactly by a factor of 2 as expected. The results again confirm the 
linear scaling property of OpenMX, and that it can maintain the same level 
of efiiciency for larger calculation scales, as demonstrated on the K computer. 

5.2.4' Strong scaling property with DNA 

In the diamond case, the structure is uniform, and the numbers of atoms 
and cores are selected so that each core is assigned the same computational 
amount. In the case of DNA, by contrast, the structure is long, and the 
number of atoms is not divisible by the number of cores. Even so, our 
decomposition scheme is still able to work effectively for the DNA structure. 



as shown in Fig. 17, where the parallel efiiciency is calculated by Eq. (20) 
with the baseline being 256 cores in Fig. [T7|(a) and 1,024 cores in Fig. |17|(b). 
With 2,600 atoms and up to 2,048 cores on the K computer (hybrid with 8 
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Figure 17: Strong scahng property with the DNA structure. 
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threads) and FXIO (flat MPI) (Fig. [T7](a)), the paraUel eflaciency at 2,048 
cores is 63.4% in the hybrid mode, and drops to only 42.9% in the flat MPI 
mode. The efliciency gap is due to a load imbalance caused by the difference 
in the number of processes in these modes. In the hybrid mode, the number 
of processes at 2,048 cores is only 256, as against 2,048 in the flat MPI mode. 
Assume that the atoms have the same weight of 1. Because there are 2,600 
atoms, each process is allocated 10 or 11 atoms in the hybrid mode, and 1 
or 2 atoms in the flat MPI mode. Obviously, the load imbalance of the flat 
MPI mode is 100%, much higher than 10% of the hybrid mode, resulting in 
the low efl&ciency. 

Figure [T7|(b) compares the elapsed time and parallel efliciency of 13,000 
and 26,000 atoms in the hybrid mode on the K computer with up to 8,192 
cores. By observing the elapsed time at each number of cores, we can easily 
recognize the better-than-linear behavior of our Krylov subspace method, 
where the elapsed time of 26,000 atoms is nearly twice as long as that of 
13,000 atoms. In addition, the parallel efliciency of 26,000 atoms is much 
higher than that of 13,000 atoms, except for 4,096 cores, for example, 70.8% 
compared to 57.2% at 8,192 cores. The result is attributed to the compu- 
tational amount per core that is doubled in case of 26,000 atoms, leading 
to a higher computation to communication ratio and eventually a higher 
efl&ciency. 

5.2.5. Strong scaling property of subroutines 

So far we have discussed only the total elapsed time, which should be bro- 
ken down further to understand the scaling behavior of the subroutines in 



DFT. Figure [18] plots the strong scaling property of the subroutines, taken 
with 16,384 diamond atoms in the flat MPI mode on XT5. Some impor- 
tant and time-consuming subroutines are noted below in order of descending 
elapsed time. 

• Diagonalization(): is a subroutine for performing the diagonalization, 
which is actually the 0(N) Krylov subspace method in the benchmark 
series. It accounts for 48.7% of the total time, and scales well, reaching 
a speedup of 7.3 at 2,048 cores compared to the ideal speedup of 8. 



Set_Nonlocal(): is a subroutine for calculating the matrix elements and 
its derivatives for nonlocal potentials in the momentum space, account- 
ing for 19.6% of the total time. The speedup at 2,048 cores is 6.5, also 
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Figure 18: Strong scaling property of subroutines. The results are taken with 16,384 
diamond atoms, flat MPI on XT5. 



good considering that fact that MPI communication is incurred in this 
subroutine. 

• Force(): As the name reveals, this subroutine calculates the force on 
atoms and contributes 9.4% to the total time. This is the worst scaling 
subroutine among the time-consuming ones, as the speedup is only 3.2, 
less than half of the ideal speedup of 8, caused by a large amount of MPI 
communication among the processes. This amount can be dramatically 
reduced if buffer regions are added to store more data, with the tradeoff 
of incurring high memory usage. Nevertheless, the subroutine is called 
only once in each MD step, and hence, becoming smaller in calculations 
that require a large number of SCF iterations. We are now in the 
process of tuning this subroutine to make it scale better. 

• TotaLEnergy(): is a subroutine for calculating the total system en- 
ergy, accounting for 8.3% of the total time. It exhibits a good scaling 
property with the speedup of 6.8 at 2,048 cores. 

• Set_OLP_Kin(): is a subroutine for calculating the overlap matrix and 
the matrix for the kinetic operator in the momentum space. It accounts 
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for 6.5% of the total time and also scales well with a speedup of 6.7. 

• Poisson(): is a subroutine for solving the Poisson's equation using FFT. 
Although its contribution to the total time is trivial, approximately 
0.02%, we note it here as it is directly related to our method for grid 
decomposition. 

In summary, all the time-consuming subroutines, except for Force(), which 
contribute more than 5% to the total time, scale well with the speedup of 6.5 
to 7.3, as against the ideal speedup of 8. As a result, the total time shows a 
good scaling property as can be seen in previous subsections. 

5.2.6. Memory usage 




Figure 19: Memory usage per process with the diamond structure, measured with 1,024 
and 2,048 MPI processes in flat MPI mode on XT5. 



Memory management is another crucial aspect that must be handled 
properly in large-scale calculations. As touched on above, we give priority to 
efficient use of memory and therefore let the processes carry out on-the-fly 
communications instead of extending the buffer to store more data in our 
implementation. Figure [19] displays the amount of memory per process used 
in the calculations for 16,384, 32,768, and 65,536 diamond atoms, measured 
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with 1,024 and 2,048 MPI processes in the flat MPI mode on XT5. The 
numbers of grids of a-, 6-, and c-axes are 420, 420, and 210 for 16,384 atoms, 
420, 420, and 420 for 32,768 atoms, and 840, 420, and 420 for 65,536 atoms. 
With the number of processes fixed at 1,024 or 2,048 and the number of 
atoms increased hnearly, the scahng property of memory usage is well below 
linear, with the scaling factors at 65,536 atoms are 2.7 (1,024 processes) and 
3.0 (2,048 processes), as against the linear factor of 4. On the other hand, 
when the number of processes is doubled from 1,024 to 2,048, the memory 
usage efficiency factors are 78.4%, 82.4%, and 87.6% for 16,384, 32,768, and 
65,536 atoms, respectively. 

6. Conclusion 

We have developed a three-dimensional domain decomposition scheme 
that is applicable to both conventional DFT methods and linear scaling DFT 
methods, consisting of two methods: one for atom decomposition and the 
other for grid decomposition. The atom decomposition method reorders the 
atoms along a principal axis based on a modified recursive bisection method 
and inertia tensor moment. The atoms that are close in real space are also 
close on the principal axis to maintain data locality. The atoms are then di- 
vided into sub-domains in a balanced way among the processes. On the other 
hand, the grid decomposition method makes it easier for the calculations of 
charge density and other potentials by defining different data structures for 
storing the grid data. Also, our proposed decomposition method for solving 
the Poisson equation using FFT in the calculation of Hartree potential ex- 
hibits up to 77.8% enhancement of communication efficiency in comparison 
to a previously proposed method with 64 x 64 x 64 grid points. Our scheme 
has been evaluated with OpenMX and the linear scaling Krylov subspace 
method, and demonstrates good strong and weak scaling properties. On the 
K computer, the parallel efficiency at 131,072 cores is 67.7% compared to the 
baseline of 16,384 cores with 131,072 diamond atoms. The efficiency is from 
68.8% to 96.4% on XT5, and from 59.1% to 94.1% on FXIO with up to 2,048 
cores, depending on the number of cores in use. The results suggest that 
our scheme is efficient for enabling large-scale electronic calculations based 
on DFT on massively parallel computers. 

We are now in the process of tuning the performance of some particular 
subroutines, especially Force(), the worst scaling subroutine among the time- 
consuming ones. Should it be improved, certainly we could further accelerate 
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the calculations and achieve a higher parallel efficiency. 
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